將特徵都整理的差不多之後,由於當初我們合併了Train以及Test兩個資料集,要在最後將原先的資料切割開,並簡單處理一下離群值。
# y為測試集
X = final_features.iloc[:len(y), :]
X_sub = final_features.iloc[len(y):, :]
X.shape, y.shape, X_sub.shape
觀察資料的過程中可以找到outlier的index
outliers = [30, 88, 462, 631, 1322]
X = X.drop(X.index[outliers])
y = y.drop(y.index[outliers])
overfit = []
# 刪除資料中大多數為0的特徵
for i in X.columns:
counts = X[i].value_counts()
zeros = counts.iloc[0]
if zeros / len(X) * 100 > 99.94:
overfit.append(i)
overfit = list(overfit)
X = X.drop(overfit, axis=1)
X_sub = X_sub.drop(overfit, axis=1)
overfit
簡單觀察一下最終整理的結果~
X.shape, y.shape, X_sub.shape